What is Scikit-Learn (Sklearn)
Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python. This library, which is largely written in Python, is built upon NumPy, SciPy and Matplotlib.
Origin of Scikit-Learn
It was originally called scikits.learn and was initially developed by David Cournapeau as a Google summer of code project in 2007. Later, in 2010, Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, and Vincent Michel, from FIRCA (French Institute for Research in Computer Science and Automation), took this project at another level and made the first public release (v0.1 beta) on 1st Feb. 2010.
Let’s have a look at its version history −
May 2019: scikit-learn 0.21.0
March 2019: scikit-learn 0.20.3
December 2018: scikit-learn 0.20.2
November 2018: scikit-learn 0.20.1
September 2018: scikit-learn 0.20.0
July 2018: scikit-learn 0.19.2
July 2017: scikit-learn 0.19.0
September 2016. scikit-learn 0.18.0
November 2015. scikit-learn 0.17.0
March 2015. scikit-learn 0.16.0
July 2014. scikit-learn 0.15.0
August 2013. scikit-learn 0.14
Community & contributors
Scikit-learn is a community effort and anyone can contribute to it. This project is hosted on https://github.com/scikit-learn/scikit-learn. Following people are currently the core contributors to Sklearn’s development and maintenance −
Joris Van den Bossche (Data Scientist)
Thomas J Fan (Software Developer)
Alexandre Gramfort (Machine Learning Researcher)
Olivier Grisel (Machine Learning Expert)
Nicolas Hug (Associate Research Scientist)
Andreas Mueller (Machine Learning Scientist)
Hanmin Qin (Software Engineer)
Adrin Jalali (Open Source Developer)
Nelle Varoquaux (Data Science Researcher)
Roman Yurchak (Data Scientist)
Various organisations like Booking.com, JP Morgan, Evernote, Inria, AWeber, Spotify and many more are using Sklearn.
Prerequisites
Before we start using scikit-learn latest release, we require the following −
Python (>=3.5)
NumPy (>= 1.11.0)
Scipy (>= 0.17.0)li
Joblib (>= 0.11)
Matplotlib (>= 1.5.1) is required for Sklearn plotting capabilities.
Pandas (>= 0.18.0) is required for some of the scikit-learn examples using data structure and analysis.
Installation
If you already installed NumPy and Scipy, following are the two easiest ways to install scikit-learn −
Using pip
Following command can be used to install scikit-learn via pip −
pip install -U scikit-learn
Using conda
Following command can be used to install scikit-learn via conda −
conda install scikit-learn
On the other hand, if NumPy and Scipy is not yet installed on your Python workstation then, you can install them by using either pip or conda.
Another option to use scikit-learn is to use Python distributions like Canopy and Anaconda because they both ship the latest version of scikit-learn.
Features
Rather than focusing on loading, manipulating and summarising data, Scikit-learn library is focused on modeling the data. Some of the most popular groups of models provided by Sklearn are as follows −
Supervised Learning algorithms − Almost all the popular supervised learning algorithms, like Linear Regression, Support Vector Machine (SVM), Decision Tree etc., are the part of scikit-learn.
Unsupervised Learning algorithms − On the other hand, it also has all the popular unsupervised learning algorithms from clustering, factor analysis, PCA (Principal Component Analysis) to unsupervised neural networks.
Clustering − This model is used for grouping unlabeled data.
Cross Validation − It is used to check the accuracy of supervised models on unseen data.
Dimensionality Reduction − It is used for reducing the number of attributes in data which can be further used for summarisation, visualisation and feature selection.
Ensemble methods − As name suggest, it is used for combining the predictions of multiple supervised models.
Feature extraction − It is used to extract the features from data to define the attributes in image and text data.
Feature selection − It is used to identify useful attributes to create supervised models.
Open Source − It is open source library and also commercially usable under BSD license.
Important features of scikit-learn:
- Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.
- Accessible to everybody and reusable in various contexts.
- Built on the top of NumPy, SciPy, and matplotlib.
- Open source, commercially usable – BSD license.
In this article, we are going to see how we can easily build a machine learning model using scikit-learn.
Installation:
The latest version of Scikit-learn is 1.1 and it requires Python 3.8 or newer.
Scikit-learn requires:
- NumPy
- SciPy as its dependencies.
Before installing scikit-learn, ensure that you have NumPy and SciPy installed. Once you have a working installation of NumPy and SciPy, the easiest way to install scikit-learn is using pip:
pip install -U scikit-learn
Let us get started with the modeling process now.
Step 1: Load a dataset
A dataset is nothing but a collection of data. A dataset generally has two main components:
- Features: (also known as predictors, inputs, or attributes) they are simply the variables of our data. They can be more than one and hence represented by a feature matrix (‘X’ is a common notation to represent feature matrix). A list of all the feature names is termed feature names.
- Response: (also known as the target, label, or output) This is the output variable depending on the feature variables. We generally have a single response column and it is represented by a response vector (‘y’ is a common notation to represent response vector). All the possible values taken by a response vector are termed target names.
Loading exemplar dataset: scikit-learn comes loaded with a few example datasets like the iris and digits datasets for classification and the boston house prices dataset for regression.
Given below is an example of how one can load an exemplar dataset:
- Python
Output:
Feature names: ['sepal length (cm)','sepal width (cm)',
'petal length (cm)','petal width (cm)'] Target names: ['setosa' 'versicolor' 'virginica'] Type of X is: First 5 rows of X: [[ 5.1 3.5 1.4 0.2] [ 4.9 3. 1.4 0.2] [ 4.7 3.2 1.3 0.2] [ 4.6 3.1 1.5 0.2] [ 5. 3.6 1.4 0.2]]
Loading external dataset: Now, consider the case when we want to load an external dataset. For this purpose, we can use the pandas library for easily loading and manipulating datasets.
To install pandas, use the following pip command:
pip install pandas
In pandas, important data types are:
Series: Series is a one-dimensional labeled array capable of holding any data type.
DataFrame: It is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.
Note: The CSV file used in the example below can be downloaded from here: weather.csv
- Python
Output:
Shape: (14, 5) Features: Index([u'Outlook', u'Temperature', u'Humidity', u'Windy', u'Play'], dtype='object') Feature matrix: Outlook Temperature Humidity Windy 0 overcast hot high False 1 overcast cool normal True 2 overcast mild high True 3 overcast hot normal False 4 rainy mild high False Response vector: 0 yes 1 yes 2 yes 3 yes 4 yes Name: Play, dtype: object
Step 2: Splitting the dataset
One important aspect of all machine learning models is to determine their accuracy. Now, in order to determine their accuracy, one can train the model using the given dataset and then predict the response values for the same dataset using that model and hence, find the accuracy of the model.
But this method has several flaws in it, like:
- The goal is to estimate the likely performance of a model on out-of-sample data.
- Maximizing training accuracy rewards overly complex models that won’t necessarily generalize our model.
- Unnecessarily complex models may over-fit the training data.
A better option is to split our data into two parts: the first one for training our machine learning model, and the second one for testing our model.
To summarize:
- Split the dataset into two pieces: a training set and a testing set.
- Train the model on the training set.
- Test the model on the testing set, and evaluate how well our model did.
Advantages of train/test split:
- The model can be trained and tested on different data than the one used for training.
- Response values are known for the test dataset, hence predictions can be evaluated
- Testing accuracy is a better estimate than training accuracy of out-of-sample performance.
Consider the example below:
- Python
Output:
(90L, 4L) (60L, 4L) (90L,) (60L,)
The train_test_split function takes several arguments which are explained below:
- X, y: These are the feature matrix and response vector which need to be split.
- test_size: It is the ratio of test data to the given data. For example, setting test_size = 0.4 for 150 rows of X produces test data of 150 x 0.4 = 60 rows.
- random_state: If you use random_state = some_number, then you can guarantee that your split will be always the same. This is useful if you want reproducible results, for example in testing for consistency in the documentation (so that everybody can see the same numbers).
Step 3: Training the model
Now, it’s time to train some prediction models using our dataset. Scikit-learn provides a wide range of machine learning algorithms that have a unified/consistent interface for fitting, predicting accuracy, etc.
The example given below uses KNN (K nearest neighbors) classifier.
Note: We will not go into the details of how the algorithm works as we are interested in understanding its implementation only.
Now, consider the example below:
- Python
Output:
kNN model accuracy: 0.983333333333 Predictions: ['versicolor', 'virginica']
Important points to note from the above code:
- We create a knn classifier object using:
knn = KNeighborsClassifier(n_neighbors=3)
- The classifier is trained using X_train data. The process is termed fitting. We pass the feature matrix and the corresponding response vector.
knn.fit(X_train, y_train)
- Now, we need to test our classifier on the X_test data. knn.predict method is used for this purpose. It returns the predicted response vector, y_pred.
y_pred = knn.predict(X_test)
- Now, we are interested in finding the accuracy of our model by comparing y_test and y_pred. This is done using the metrics module’s method accuracy_score:
print(metrics.accuracy_score(y_test, y_pred))
- Consider the case when you want your model to make predictions out of sample data. Then, the sample input can simply be passed in the same way as we pass any feature matrix.
sample = [[3, 5, 4, 2], [2, 3, 5, 4]] preds = knn.predict(sample)
- If you are not interested in training your classifier again and again and using the pre-trained classifier, one can save their classifier using joblib. All you need to do is:
joblib.dump(knn, 'iris_knn.pkl')
- In case you want to load an already saved classifier, use the following method:
knn = joblib.load('iris_knn.pkl')
As we approach the end of this article, here are some benefits of using scikit-learn over some other machine learning libraries(like R libraries):
- Consistent interface to machine learning models
- Provides many tuning parameters but with sensible defaults
- Exceptional documentation
- Rich set of functionality for companion tasks.
- Active community for development and support.